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Abstract 


Suppose  that  a large  number  of  forms  of  the  same  test  axe  administered 
to  the  same  group  of  examinees,  each  form  consisting  of  a random  sample  of 
items  drawn  from  a common  pool  of  items.  If  some  test  statistic  is  com- 
puted separately  for  each  form  of  the  teBt,  the  value  obtained  will  (ig- 
noring practice  effect,  fatigue,  etc.)  differ  from  form  to  form  because  of 
sampling  fluctuations . The  standard  deviation  of  the  values  obtained 
represents,  approximately,  the  standard  error  of  the  test  statistic  when 
the  test  Items  are  sampled. 

Formulas  for  such  standard  errors  are  here  derived  for  a)  the  test 
score  of  a single  examinee,  b)  the  mean  test  score  of  a group  of  exami- 
nees, c)  the  standard  deviation  of  the  scores  of  the  group,  d)  the 
Kuder -Richards on  reliability  of  the  teat,  formula  20,  e)  the  Kuder- 
Riehardson  reliability,  formula  21,  f ) the  test  validity.  In  large 
samples,  the  foregoing  statistics  (with  the  possible  exception  of  d)  are 
approximately  normally  distributed,  so  that  significance  tests  can  be  made 
by  familiar  procedures. 

Consideration  is  given  to  the  relation  of  certain  of  the  foregoing 
standard  errors  to  the  conventional  standard  error  of  measurement , to 
the  Kuder-Richardeon  reliability  coefficients  20  and  21,  and  to  the  Wilks - 
Votav  criterion  for  parallel  tests.  Practical  applications  of  the  results 
are  briefly  discussed.  In  particular,  it  is  concluded  that  the  Kuder- 
Richardson  formula-21  reliability  coefficient  should  properly  be  used  in 
certain  practical  situations  instead  of  the  commonly  preferred  forraula-20 
coefficient . 


THE  STANDARD  ERRORS  OP  VARIOUS  TEST  STATISTICS 
WHETS'  THE  TEST  ITEMS  ARE  SAMPLED* 

Frederic  M>  Lord 

Suppose  "that^he  same  test  Is  administered  to'  a large  number  of 
separate  groups  of  examinees , the  groups  being  random  samples  all  drawn 
from. the  same  population]  and  suppose  that  some  test  statistic  is  com- 
puted separately  for  each  sample  of  examinees , The  value  obtained  for 
this  test  statistic  will,  of  course,  differ  from  sample  to  sample  be- 
cause of  sampling  fluctuations . The  standard  deviation  of  these  values 
over  a very  large  number  of  samples  1b  the  standard  error  of  the  test 
statistic  when  examinees  are  sampled.  For  convenience,  this  type  of 
sampling  will  he  referred  to  as  type  1 sampling. 

On  the  other  hand,  suppose  that  a large  number  of  forms  of  the 
same  test  are  administered  to  the  same  group  of  examinees,  each  form 
consisting  of  a random  sample  of  items  drawn  from  a common  population 
of  items;  and  suppose  that  soma  test  statistic  is  computed  separately 
for  each  form  of  the  test . Let  us  assume  for  theoretical  purposes 
that  the  examinees  do  not  change  in  any  way  during  the  course  of  test- 
ing, i.e.,  that  there  is  no  practice  effect,  no  fatigue,  etc.  The  value 
computed  for  the  test  statistic  will  still,  of  course,  differ  from  form 
to  form  because  of  sampling  fluctuations . The  standard  deviation  of 
these  values  ever  a very  large  number  of  samples  is  the  standard  error 
of  the  test  statistic  when  the  test  items  are  sampled.  This  type  of 
sampling  will  be  referred  to  as  a type  2 sampling.  Test  forms  con- 
structed by  type  2 sampling  will  be  called  randomly  parallel  forms  or 
randomly  parallel  tests . 

The  writer  is  indebted  to  Professor  S'.  S.  Wilks  who  has  checked 
over  certain  critical  portions  of  a draft  of  this  paper. 
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Type*  1 standard  error  formulas  have  long  been  available  and  are 
aca&etis&es  incorrectly  used  in  situations  vhere  sampling  of  test  items 
is  of  crucial  importance.  The  present  paper  is  concerned  with  deriv- 
ing formulas  for  the  type  2 standard  errors  of  certain  test  statistic®. 
Formulas  for  the  two  kinds  of  standard  errors  may  usually  be  readily 
distinguished  on  a superficial  level  by  the  following  characteristics, 
which  underscore  the  essential  difference  between  thea:  type  1 stand- 

ard errors  are  usually  obviously  proportional  to  scene  power  (positive 
or  negative)  of  the  number  of  examinees  in  the  sample  — most  ccnononly 
inversely  proportional  to  the  square  root  of  this  number  — and  ere 
usually  much  less  obviously  and  simply  related,  if  at  all,  to  the 
number  of  items  in  the  test;  type  2 standard  errors  have  the  corre- 
sponding characteristic  with  respect  to  n , the  number  of  items  in 
the  sample . 

Notation  and  Summary  of  Formulas 

The  test  statistics  with  which  the  present  study  is  concerned 
are  primarily  the  followings 

t — the  observed  test  score  of  examinee  a , obtained  by 
counting  the  number  of  items  answered  correctly  on  a 
single  test. 

t — the  mean  of  the  scores  obtained  by  the  K examinees  on 

a single  test.  t = Et  /N  . 

a a 

s^  — the  standard  deviation  of  the  scores  obtained  by  the  N 
examinees  on  a single  test.  = Et'r/N  - t“  . 
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the  Kuder-RichBTtlson  reliability  coefficient*  formula  21. 


r or  — the  Ruder -Richardson  reliability  coefficient* 

formula  20.  r = -—rr  (l  - Es^'/b^)  { symbols  explained 

in  the  succeeding  list). 

r . — the  correlation  of  the  test  score  with  any  external 
ct 

variable,  c . rct  - *ct/*c*t  • 


Considerable  care  in  defining  notation  must  be  taken  here  in  order 
to  avoid  serious  confusion.  Additional  symbols  that  will  he  used  are 
listed  below  for  easy  reference. 

xift  — the  "score"  of  examinee  a on  item  i . xla  * 1 if 
the  item  is  answered  correctly.  *ia  **  0 otherwise. 

n — the  number  of  items  in  a single  form  of  a test,  i.e.,  in 
a single  sample.  The  subscript  i runs  from  1 to  n . 

N — the  number  of  examinees  in  a single  group  of  examinees . 

The  subscript  a runs  from  1 to  N . 

a — the  number  of  items  in  a finite  population  of  items. 

■ — the  observed  "difficulty"  of  item  i for  the  N exami- 
nees tested.  Pj^  ” &ia/N  . 

fit 


$1  " 1 - Pi 


k 

sa  "proportion- correct  score"  of  examinee  a j the  propor- 

tion of  the  items  in  a single  test  answered  correctly  by 

examinee  a . z - t in  . 

a a' 

z , c , etc.  — the  mean  of  the  N values  of  z , c , etc. 
z = £z  /N  , etc . 


M(p)  — the  mean  of  the  n observed  values  of  for  the 

items  in  the  test  administered.  M(p)  = £p./n  . 

i 1 


n 


s , s , etc.  — • the  standard  deviation  of  the  K values  of 

c z 


c , z , etc. 


s2  = Ez2/n  - z2  , etc. 

**  a & 


a.  — the  standard  deviation  of  x.  for  fixed  i . 
i ia 


Bi  * ?ia/N  “ <?IaA>  “ *i*i 


sct  t etc.  — the  covariance  (over  examinees)  of  c and  t , 


etC*  8ct  “ flcVct  “ S(ca  ” S)(ta  - * 


i^c  t s ^ , s^.  the  covariance  (over  examinees)  of  c&  , 

z&  , or  tft  , respectively,  with  x^a  , for  fixed  i 

Blt  “ 8iVit  “ ?<*!*  ’ ' 1>A  • 


s(p)  — the  standard  deviation  of  the  n observed  values  of 
for  the  n items  in  the  test  administered. 


s2(p)  = p^/n  - M^p) 
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s(siz)  , s(s^)  , etc.  — the  standard  deviation  of  the  a 


'it 


observed  values  of  s.  , s..  , etce  for  the  n items  in 

■ i iz  it  ' 

s2(Bit)  “ Z*lt/n  - (Za^/nf  . 


the  test  administered. 


B(Sic,Sit)  — the  covariance  (over  items)  of  s^c  and  . 

s(sic'sit)  = ^,S^/D  " (ZB1n/n)(ZB^Jn)  . 


ic  it 


it' 


ric  J *it  * ria  ""  the  correlation  °f  c&  , tft  , or  z&  , 


respectively  with  x^&  , for  fixed  i 


it 


■WVt 


It  should  he  noted  that  all  the  statistics  in  the  foregoing  list 

are  observed  sample  statistics  relating  to  a given  sample.  There  are 

two  kinds  of  statistics  listed,  typified,  in  the  simplest  case,  by 

z = Ez 
a 

listed  hut  will  he  designated,  when  needed,  hy  the  uBe  of  Greek  letters 
The  following  additional  symbols,  relating  to  the  totality  of  all  pos- 
sible samples  of  test  items  (type  2 sampling),  will  he  used. 


/n  and  M(p)  = Ep./n  . Population  parameters  have  not  been 
a i 1 


E(x)  — the  expected  value  of  x j the  arithmetic  mean  of  the 
statistic  x over  all  possible  samples. 

S.E.(x)  — the  standard  error  of  the  statistic  x ; the  standard 
deviation  of  the  statistic  x over  all  possible  samples. 
S.E.2(x)  =E(x2)  -[E(x)32  . 

Var  x --  the  sampling  variance.  Var  x - S.E.^(x)  . 

Cov(x,y)  — the  sampling  covariance  of  the  statistics  x and  y 
ever  all  possible  samples.  Cov(x,y)  = E(xy)  - E(x)E(y) 
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Taole  i summarizes  the  more  important  of  toe  type  2 standard  errors 
derived  is  the  present  paper.  For  purposes  of  cc*npa->'isGn,  the  last 
column  of  the  table,  when  appropriate,  give a the  corre spending  usual 
type  1 foramina  for  the  Btaodard  error  for  the  ease  -where  the  test 
scores  are  assumed  to  be  normally  distributed.  The  standard  error 
formulas  in  both  columns  are  large -sample  formulas,  in  general,  and 
observable  sample  statistics  hays  been  substituted  fox  the  corresponding 
population  values  throughout. 

At  a later  point  it  will  be  proven  that  in  large  samples  of  the 
second  type  all  of  the  tose  otatlotica  in  the  left-hand  column  of  Table 
1,  with  the  possible  exception  of  the  Kuder -Richards on  formula  20  reli- 
ability coefficient,  have  an  asymptotically  normal  sampling  distribution. 


Standard.  Errors  of  Test  Statistics 


•pi  te  p|  IS 

cu  I Sw  to  I 5k 


Not  known  to  writer 
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Illustrative  Examples  and  Discussion  of  the  Standard  Errors 


Suppose  that  Form  A of  a certain  135-item  test  has  been  administered. 
Several  parallel  forms  of  this  same  test  are  to  be  administered  in  the 
future.  Each  form  is  administered  to  a different  group  of  examinees. 

The  groups  of  examinees  may  be  considered  as  random  samples  drawn  from 
the  Bame  population.  Each  group  is  so  large  that  differences  between 
groups  due  to  sampling  of  examinees  may  be  ignored.*  It  is  found  that 
the  mean,  standard  deviation,  and  Kuder-Richardson  formula  20  reliability 
of  the  scores  on  Form  A are  63. 5,  21.5,  and  0.95,  respectively.  How  much 
may  we  ..expect  the  means  to  vary  from  form  to  form? 


The  required  value  of  s(p)  can  be  determined  directly  from  item 

analysis  data]  or  it  can  be  calculated  from  the  three  numerical  values 

2 

given  by  solving  for  a (p)  Tucker 1 b modification  (8)  of  the  equation 
for  the  Xuder-Ricnardson  formula  20  reliability,  the  result  being: 


2 

st /n-1 
n ' n 


n 


We  find  that  s^(p)  — .0538 

The  large-sample  estimate  of  the  type  2 standard  error  of  the  mean 
is  found  to  be  S.E.g(-t)  = 2.7  . (The  subscript  "2"  is  used  here,  and 
the  subscript  "l"  is  used  below,  to  indicate  type  2 and  type  1 standard 
errors,  respectively.  Hereafter,  type  2 sampling  will  be  -understood, 
unless  otherwise  specifically  indicated.)  If  the  same  test  were  admir- 


* 

Useful  formulas  for  dealing  simultaneously  with-  sampling  of  items 
and  sampling  of  examinees  have  been  developed  by  the  writer  for  certain 
of  the  statistics  studied  here . Some  such  formulas  are  recently  inde- 
pendently reported  in  Hooke,  R.,  "Sampling  from  a matrix,  with  applica- 
tions to  the  theory  of  testing."  Princeton  University  Statistical  Re- 
search Group,  Memorandum  Report  53,  1953*  (Dittoed.) 
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ietered  to  random  groups  of  155  examinees,  the  type  1 standard  error 
■would  "be  * l.S  . 

On  the  "basis  of  the  foregoing,  we  may  expect  that  parallel  forms  of 
the  test  would  not  differ  from  each  other  in  mean  score  by  as  much  aa 
2/2£LE.g{t)  =7-6  Points  mors  than  one  time  in  twenty.  If  the  parallel 
forms  are  carefully  constructed  by  matching  items  from  form  to  form  on 
difficulty  and.  item-test  correlation  rather  than  by  random  sampling  of 
items,  it  may  well  he  that  the  forms  will  not  differ  from  each  other 
as  much  as  the  foregoing  formulas  would  indicate.  On  the  other  hand, 
it  is  not  unlikely  that  supposedly  parallel  forms  of  a test  may,  because 
of  the  unconscious  bias  of  the  test  constructor,  often  be  found  in  fact 
to  be  less  parallel  than  would  he  expected  if  each  form  were  a random 
Sample  of  test^items. 

In  many  kinds  of  statistical  experiments  it  is  commonly  not  merely 
desirable  hut  actually  necessary  to  select  cases  by  random  sampling 
rather  than  by  stratified  sampling,  even  though  random  sampling  gives 
rise  to  larger  sampling  fluctuations.  The  reason  is,  first,  that  random 
sampling  tends  to  avoid  unintentional  bias;  and,  second,  that  the  stand- 
ard errors  arising  from  random  sampling  are  known  and  easily  used, 
whereas  those  arising  from  stratified  sanpling  are  often  either  unknown 
or  excessively  cumbersome  to  use.  Similarly,  and  for  the  same  reasons, 
it  will  be  desirable  in  certain  kinds  of  experimental  work,  to  use  par- 
allel forms  composed  of  items  selected  at  random  rather  than  in  any 


other  way 
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Suppose,  for  example.  It  is  desired  to  investigate  the  relation  of 
length  of  reading  passage  to  validity  in  a reading  c amprehension  test. 

The  experimenter  might  well  select  at  random  from  a pool  of  all  avail- 
able reading  items  of  some  specified  difficulty  level  (a)  a sample  of 
all  items  based  on  passages  containing  more  than  200  words  and  (b)  a 
sample  based  on  passages  containing  less  than  100  words  (it  is  assumed 
here  that  there  is  only  one  item  per  reading  passage).  He  then  places 
these  items  in  random  order  and  administers  them  to  a group  of  examinees, 
obtaining  separate  scores  for  the  long  and  for  the  short  items . He  com- 
putes the  validity  of  each  score,  using  some  saefllable  criterion.  If 
the  two  validity  coefficients  differ  by  little  more  than  the  type  2 
standard  error  of  their  difference,  it  seems  likely  that  the  difference 
is  attributable  to  chance  fluctuations  due  to  the  sampling  of  items. 

If  they  differ  by  several  times  thiB  standard  error,  the  opposite  con- 
clusion may  be  reached]  insofar  as  other  uncontrolled  experimental  var- 
iables are  ruled  out,  the  difference  may  plausibly  be  attributed  to  length 
of  reading  passage . 

A note  of  cautior.  is  necessary  in  using  the  type  2 standard  error 
formulas.  These  formulas  involve  no  assumptions  beyond  random  sampling 
and  large  n j however,  it  Is  not  at  present  known  just  how  large  an  n 
is  needed  in  any  given  case.  The  formulas  In  Table  1,  -therefore,  should 
be  used  with  some  caution.  This  is  particularly  true  of  the  last  three 
rows  of  the  table,  since  the  correlation  coefficients  given  in  the  first 
column  undoubtedly  have  sharply  skewed  distributions  when  n is  small. 

It  Bhould,  finally,  be  noted  that  the  assumption  of  random  sampling 
of  items  cannot  be  expected  to  hold  for  speeded  tests,  and  the  formulas 
given  in  the  present  paper  must  be  considered  inapplicable. 
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Standard  Errors  of  Measurement  and  Test  Reliability 

Table  1 gives  a practical  approximation  to  S.E.(t  ) in  terms  of 
observed  sample  statistics)  the  rigorously  accurate  value , as  shown  in 
a later  section  is 

S.E.2(ta)~ira(»-Ya)  . (2) 

Here  Tq  “ E(t&)  is  the  true  score  of  examinee  a , l.e.,  the  expected 
value*  of  t over  all  randomly  parallel  forms  of  the  test.  The  stand- 
ard error  of  the  score  of  an  examinee  is  the  standard  deviation  of  the 
errors  of  measurement  of  his  score  (error  of  measurement  *=  t - T ). 

The  average  of  such  standard  deviations  of  errors  of  measurement  over 
all  examinees, 

^s.E.2(ta)  - ^E(ta  - ra)2  , (j) 


*The  expectation  symbol,  E , denotes  the  average  {aMtfcm&tic  mean) 
value  over  all  type  2 samples.  Thus  the  operator  E can  be  treated  by 
the  same  rules  as  a summation  sign,  so  that  E(x  + y)  = E(x)  + E(y)  , 
EE(t  ) = ZE(t  ) , E(nt)  = nE(t)  , E(T  ) *=  T , etc.  By  def  nition 

g B g 8.  B B 

ra  - E(ta)  , S.E.2[f(tO  = E[f(t)  - - s{f(t)}2  - [>{f(t)}3f  , 


and  Cot  f x(t),f2(t)J=  E^OOf^t)!  - E^t)  pjf2(t)J  , where  f ( t ) 


is  any  function  of  t . 
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msy  appropriately  be  c cap  axed  with  the  conventional  ’’standard  error  of 
measurement"  of  test  theory*  This  latter,  which  will  be  denoted  by 
"S»S.Meas.,"  is  likewise  an  average  over  all  examinees.  It  is  convert 
tianally  defined  by  the  formula 


S.E.Me&e.  **  0 J1  - reliability 


Specifically,  it  will  now  be  shown  that  the  squared  standard  error 
of  measurement  given  by  equation  J is  exactly  equal  to  that  which  would 
be  expected  in  equation  ^ if  the  teat  reliability  there  were  given  by 
the  Kuder -Richards an  formula  21  in  reference  (5).  In  our  natation* 


this  formula  is 


n t 


S+  - £(1  - l/n) 


L21  n-1 


Averaging  equation  2 over  all  examinees,  we  find 


- V 


fera-2gTe 


f ' ^(>V  ♦ T2)  • 


From  (5)  and  (4),  the  expected  value  of  the  squared  S.E.Meas.  is 
E[st(l  * r21)]  = E\n~T  ^ “ S?)J  • 
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In  order  to  deal  with  (7)  ve  firot  need  expressions  for  l(s£) 

p 

and  ${£)  • 

*(■?)  - - l>2]  - - Ta>  ♦ (Ta  - f ) - (t  - f)}^  • (8) 

After  squaring  and  rearranging  E and  £ signs , 

E<8t>  * sjf^a  ~ \)2|  * E'{|<Ta  - *?}  + * *>*}  + 

2E(Ta  - T)E[(ta  - ra)}  - 2e/(I  - r)z(tB  - rR)}  - ee{(S  - f)|(ra  - f)}]  . 

(9) 

Now  the  fourth  and  the  last  terms  an  the  right  YKnlsh  since  E(t  - •f  } 

<X  *i 

and  ZC'T'  » r)  both-  equal  zero.  It  is  seen  that  ve  hare,  terra  for  term* 
a a 

E(s^)  * |g  Var  ta  + a*  + Var  £ + 0 - 2 Vttr  £ - 0 . (10) 

Now  Var  t is  giren  by  (2),  ao  that 

E(®2)  * SEtf^11  - ra)  + 4 - Vax  € . (11) 

Finally,  proceeding  as  in  (6),  we  hare 


E(s^)  = r ~ | ^ ct^,  - Var  £ . 


(12) 
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Next, 

E(£S)  - B[J€  - ^ + ff 

* i(l  - rf  + 2fB(€  - r)  + e(^) 

* Var  ? + 7^  . (13) 


From  (T),  (12),  and  (13), 


Var 


t 


l2  - Y + ~ f2 

n 


n-1  2 _ *1 

-;r  s* + Var 


* f 


1 

n 


(14) 


This  result  is  the  aan e as  that  in  (6).  VTe  hare  shown  that  the  average 
squared  standard  error  of  xoeasureaent  found,  in  type  2 sampling  la  exactly 
equal  to  the  expected  value  of  the  squared  S.E.Keas.  derived  frcm  the 
formula  21  Kuder-^ttchar&aon  reliability  coefficient. 

The  logical  relation  between  Kuder  -RichardB on.  formulas  SO  and  21 
can,  be  derived  tram,  equations  1 and  5,  from  which  it  is  readily  found 


that 


Ac- 


T2q) 


EL2  2 


21  ‘ 


n.~u  p 


(15) 


Now  the  tern,  on 'the  left  and  the  first  term  cn  the  right  of  (15)  are 
the  squared  standard  errors  of  measurement  computed  from  r^Q  and 
from  rpi  , respectively.  Furthermore,  since  ns'/(n  - l)  is  the 


best  unbiassed  small~sess3Pl«  estimate  of  the  population  variance  , 
it  is  seen  that  the  last  term  on  the  right  is  the  small-sample  estimator 
for  the  squared  standard  error  of  the  mean,  score  (see  equation  22). 
Consequently,  ve  may  revrite  (15)  aB 

(S.E.Meas.gQ)2  = (S .E.Meas .^)2  - S.E/CT,)  » (l 6) 

The  difference  between  r^Q  and  r^  , as  made  apparent  in 
equation  16,  arises  from,  the  fact  that  seme  randomly  parallel  forms 
are,  by  chance,  composed  of  harder -than-arerage  items,  or  of  easier- 
tban-average  items  | consequently,  the  mean  of  the  actual  scores  an 
any  given  test  is  not  exactly  equal  to  the  mean  of  the  true  scores 
for  the  same  examinees.  The  use  of  r^  is  appropriate  whenever  one 
is  willing  to  ignore  any  difference  between  the  mean  test  score  of  the 
group  and  their  mean  true  score,  l.c.,  when  one  is  concerned  only  with 
the  relative  rather  than  the  absolute  size  of  the  scores  of  the  group. 
On  the  other  handj  rp-,  should  he  used  whenever  one  is  concerned  with 
the  actual  magnitude  of  the  errors  of  measurement,  e.g.T  whenever  there 
is  a predetermined  cutting  score  which  divides  the  examinees  into 
paBBlng  and  failing  groups . 
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Comparison  with  Certain  Standard  Formulas 

A formula  closely  related  to  equation  k is  the  following  (adapted 
from  equation  66  of  reference  (7)): 

S=E«(t)  = -^s  /l  - reliability  . (66) 

The  question  arises  as  to  why  S.E.(^)  in  equation  66  has  a 
totally  different  formula  from  that  giren  in  Table  1 for  the  type  2 
standard  error  of  the  mean.  If  we  use  equation  66  to  determine  whether 
or  not  two  forms  of  a test  yield  significantly  different  mean  scores, 
we  will  always  find  the  difference  to  he  significant  provided  only 
that  we  take  a sufficiently  large  number  of  examinees  ( N ) for  our 
experiment.  This  is  true  because  the  standard  error  of  equation  66  1b 
inversely  proportional  to  /N  — the  standard  error  vanishes  when  K 
is  large.  In  spite  of  this  fact,  it  should  he  noted  that  (66)  1b  not 
a type  1 standard  error.  A type  1 standard  error  involves  the  sampling 
of  individuals,  whereas  only  a single  group  of  examinees  is  contemplated 
in  (66). 

The  standard  error  given  in  equation  66  represents  only  the  sampling 
fluctuation  due  to  those  errors  of  measurement  that  "average  out"  when 
taken  over  many  individuals.  Such  errors  of  measurement  arise  from 
virtually  instantaneous  "chance"  fluctuations  in  the  individual.  One 
example  of  such  an  error  of  measurement  is  the  following:  An  examinee, 

not  knowing  the  answer  to  a true-false  item  tosses  a coin,  in  effect,  to 
select  the  correct  answer.  If  the  same  test  could  he  administered  again 
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without  practice  effect  , the  same  exssineecvould  hare  a fifty-fifty 
chance  of  giving  & different  answer.  This  difference  gives  riae  to  an 
error  of  measurement  of  the  type  under  discussion. 

The  standard  error  of  the  mean  given  in  Tails  1 includes  not  aaly 
sampling  errors  of  the  sort  Just  mentioned,  hut  also  sijnpling  errors 
arising  from  the  sampling  of  the  teat  items. 

The  line  of  reasoning  applied  to  equation  66  is  equally  applicable 
to  Wilis1  (10)  and  to  Votav’s  (9)  significance  tests  when  either  of 
these  is  used  as  a criterion  of  "parallelism”  in  tests,  as  suggested 
by  Gulliksen  (3*  Ch.  14),  Gulliksen  defines  "parallel"  tests  as  having 
equal  means,  equal  variances,  and  equal  intercorrelations  with  each 
other  and  with  all  external  criteria  (as  well  as  satisfying  appropriate 
non- Bt Rt iet ie al  criteria  of  parallelism).  Wilks*  and  VotavT a signifi- 
cance tests  provide  rigorous  statistical  criteria  for  "parallelism" 
under  thial  definition.  It  would  not  be  very  desirable,  however,  to 
apply  Wilks’  or  Votaw’s  procedures  to  data  such  as  were  obtained  in  the 
second  illustrative  example  given  in  a preceding  section.  If  a test 
composed  of  items  having  a certain  characteristic  is  to  be  compared  with 
a test  composed  of  different  items  having  a second  characteristic,  it 
may  not  be  very  useful  to  set  up  the  null  hypothesis  that  the  two  tests 
are  Btrictly  interchangeable  in  every  way.  . Such  a null  hypothesis  will 
always  be  rejected  if  K is  sufficiently  large,  but  the  rejection  of 
thi  a hypothesis  does  not  necessarily  imply  that  the  first  and  second 
characteristics  have  different  effect,  since  the  observed  discrepancy 
might  be  readily  accounted  for  as  no  greater  than  would  be  expected  to 
be  found  in  comparing  two  randomly  parallel  teste  composed  of  the  same 
kind  of  items. 
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Bantling  Distributions  of  Test  Statistics 

It  remains  only  to  present  the  derivations  of  the  results  that  have 
up  vo  now  teen  quoted  without  proof.  The  derivations  are  bfese&fon  the 
assertion  that  there  is  a definite  response  ( x^a  ) that  a given  examinee 
will  make  to  a given  item.  The  nature  of  this  response  may  or  may  not  be 
known  in  advance . The  group  of  N examinees  to  whom  the  items  or  tests 
are  administered  is  a fixed  group  not  subject  to  sampling  fluctuation  or 
other  changes. 

The  responses  of  the  N examinees  to  item  1 may  he  specified  by 

the  column  vector  ^x^  ~ * Since  each  item  response 

is  assumed  to  be  treated  as  either  "right”  or  "wrong”,  x^a  = 0 or  1 , 

N 

and  there  are  exactly  2 possible  different  vectors,  i.e.,  different 

patterns  of  item  response.  If  we  let  the  subscript  I *=  1,2,5*  • • .,2^  , 

N 

then  these  possible  patterns  are  represented  by  the  2 vectors 
If  two  items  have  exactly  the  same  pattern  of  responses,  i.e.,  if  the 
response  of  each  examinee  1b  the  same  on  both  items,  then  the  two  items 
are  wholly  indistinguishable  in  the  present  situation.  It  may  therefore 
he  asserted  without  loss  of  generality  that,  for  present  purposes,  any 
infinite  pool  of  items  is  cojnposed  of  2^  different  kinds  of  items, 
designated  by  the  2^  vectors  xT  . The  relative  frequencies  of 

X 

occurrence  of  the  different  kindr  of  items  are  therefore  the  only 
parameters  needed  to  describe  completely  any  infinite  pool;  these 
parameters  will  be  denoted  by  Tf  , the  relative  frequencies  of  occur- 
rence of  the  patterns  . 


Wien  a random  sample  of  n test  items  is  drawn  from  the  pool, 

the  probability  that  the  resulting  n - item  test  will  be  composed  of 

items  of  the  first  kind,  n^  items  of  the  second  kind,  n^ 

items  of  the  I - the  kind,  n „ items  of  the  2^  - th  kind 

(2®) 

is  given  by  the  standard  multinomial  distribution  (6,  pp.  58-59): 


f (b^,ng, • • • ,n 


n. 


TTtt  1 

(2n)  TjV  T 1 


(IT) 


It  can  be  shown  (l,  p.  ^19)  that  the  quantities  V^.  = (n^  - ntf^)//mr^ 

are  asymptotically  normally  distributed  for  large  n with  zero  means 

and  with  the  (singular)  variance -covariance  matrix  I ~ TTt T‘  , where  I 

is  the  identity  matrix  and  TT  is  the  column  vector  (^/T,  y/ffU,.*.,  /7T  m 

X (2H) 


Now,  the  test  score  of  individual  a is  za  ~ ~ ~SxTnnT  , the 

Xja  being  given  constants,  0 or  1,  not  subject  to  sampling  fluctuation; 

or,  in  terms  of  VT  , z = E'ftLx_  + i Z JlfZx  V . The  first  term  on 
I a _ 1 la  r -r  1 la  1 

I v'n  * 

the  right  is  JT a =T  / n , the  "time"  proportion-correct  score;  so  thRt, 


finally,  n/H(z  - J ) = Z _ VT  . It  is  thus  seen  that  the  N var- 

8,  El  y x 1 9.  X 

iables  >/n(z  - S ) are  asymptotically  jointly  multinormally  distributed, 

each  with  a mean  of  zero,  a variance  which  turns  out  to  be  S’  (l  - S ) , 
and  covariances  *?  - J , where  S . is  the  proportion  of  all  items 

cLD  8.  D 8,0 

answered  correctly  by  both  examinee  a and  examinee  b . It  follows 
immediately  that  the  large-sample  standard  error  of  z is 

S,  8.  8, 

(cf . (2)).  The  derivation  of  these  and  other  standard  errors  will  be 
left  to  the  following  section,  however . 
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By  a well-known  theorem.  If  f (z^,Sgf . . .,za)  Is  a function  of  the 
z&  having  continuous  first-order  partial  derivatives  with  respect  to 
each  za  at  the  point  > and  if  at  least  one  of  these 

derivatives  Is  nonvanlshing  at  this  point,  then  the  entity  ysjf(V 
Zg,. ..,z^)  - f (^,$g,  . . . is  asy*l©'totically  normally  distributed 

with  zero  mean  when  n is  sufficiently  large.  This  theorem  assures  us 
that  the  mean  score  ( z or  £ ),  the  standard  deviation  of  the  scores 
( 3z  or  st  ),  the  Kuder-P.ichardson  formula  21  reliability  { r21.  ) , 
and  the  test  validity  { rcz  or  r ^ ),  are  approximately  normally 
distributed  In  type  2 sampling  wluxi  large  n ; and  in  addition  gives  us 
the  large-sample  expected  value  of  each  statistic.  It  seems  highly 
likely  that  the  Kuder-Richardson  reliability,  formula  20,  likewise  is 
asymptotically  normally  distributed,  but  no  proof  of  this  conclurion 
is  available  at  present.  In  view  of  the  fact  that  the  formula  for  this 
statistic  involves  a (p)  , which  is  not  a function  of  the  z . 
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Derivations  of  Expected  Values  and  Standard  Errors 
The  Individual  Score 

The  proportion  of  the  items  in  the  entire  pool  to  which  examinee 
a will  give  the  correct  answer  is,  by  definition,  5 “ T /n  . If 

n items  are  drawn  at  random  from  the  pool,  t , the  score  of  examinee 
a on  the  resulting  test,  i.e.,  the  number  of  items  that  he  will  answer 
successfully,  will  of  necessity  have  the  usual  binomial  distribution* 
with  mean  and  variance 

S(tE)  - \ , (18) 

S-E-2(ta)  - I Te(»  - Ta)  - nsa(l  - Sa)  • (19) 

This  conclualon  (and  also  those  that  follow,  except  as  large  n may  be 
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S.E.2(ta) 


m-n 

mn 


n 


(19') 


The  Mean  Score  of  the  Group  Tested 

It  should  he  noted  that  the  scores  of  examinees  'a  and  b are 
not  independent  over  different  parallel  forms  of  the  test . If  a 
particular  form  happens  to  be  composed  of  rather  difficult  items,  both 
examinees  will  tend  to  get  low  scores;  if  a particular  form  happens  to 
be  easy,  both  will  tend  to  score  higher.  Consequently,  although  the 
expected  value  of  the  mean  score  in  the  group  i s equal  to  the  mean  of 
the  expected  values  of  the  individual  scores,  i.e., 


-r  , 


(20) 


the  standard  error  of  the  mean  is  not  an  average  of  the  standard  errors 
of  the  individual  scores. 

It  will  he  convenient  from  this  point  on  to  work  with  z »-  t /n  , 

EL  £1 

the  proportion-correct  score,  rather  than  with  t itBelf . The  nature 
of  the  desired  standard  error  follows  immediately  from  the  fact  that 
the  mean  score  ( z ) is  identically  equal  to  the  average  item  difficulty 


z=M(p)  . (21) 

The  usual  formulas  for  the  standard  error  of  a- mean  apply  to  M(p)  , 
so  that 

S.E.2(z)  = -|a2(p)  (22) 

where  a(p)  is  the  standard  deviation  of  the  item  difficulties  over 
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the  whole  pool  of  items.*  If  the  observed  value  of  s^p)  is  Bubsti- 
tuted  for  the  unknown  c‘(p)  , and  if  t/n  is  substituted  for  z , the 
square  of  the  second  formula  of  Table  1 is  obtained. 

In  sampling  from  a finite  pool  of  m items,  the  corresponding 
formula,  stated  without  proof,  is 

S.E.2(z)  - ~~02(p)  • (22') 

We  may  note  that  cr(p)  for  a given  set  of  items,  and  hence 
S.E.g(z)  for  a given  teBt,  will  be  higher  when  - N is  small  than  when 
N is  large.  Suppose,  for  example,  that  all  items  have  the  Bame  dif- 
ficulty ( p ) for  a very  large  group  of  examinees,  so  that  for  this 
group  a(p)  ==  0 . If  the  same  items  are  administered  to  a smaller 

group  of  examinees  drawn  at  random  from  the  larger,  the  observed  values 
of  in  the  smaller  group  will  differ  from  each  other  because  of 

type  1 sampling  fluctuations,  and  cr(p)  will  be  greater  than  zero. 

In  the  extreme  case  where  N = 1 , the  observed  values  of  p are  of 
necessity  either  0 or  1,  and  a(p)  is  at  a maximum. 


Equation  19  i8  a special  case  of  equation  22,  being  obtained  when 


pi  = xia 
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The  Standard  Deviation  of  the  Scores  of  the  Group  Tested 

2 

la  order  to  obtain  the  standard  error  of  e .we  first  use  the 

z 

formula  for  the  variance  of  a sum  to  write 


-S-Tsa-i*. 


being  the  covariance  between  item  i and  item  h . Then,  again 
from  the  formula  for  the  variance  of  a sum. 


Var  s * — r ££22  Cav(Bj.  ) > 

z n4  hijk  m J* 


where  "Cov*  stands  for  the  sampling  covariance t Cov(B^h,s^v)  ■*  Es^Sj 
- EBihBBjk  . 

Grouping  the  Bums  in  (24),  we  obtain 


Tar  •*  -?  %,  i,  3:  * % ow(‘«',*) + 2 <w  a CM‘l’B*) 


n^-6n3+lln2-6n 


5 2 

n -5n  +2n 


n3-3n2+2n 


2 

n »n 


+ 4 £ £ £ Cov(s,  , ,B  .•  ) + 4 £ £ Cov(b.,b.  .)  + 


containing  no  more  than  n terms  each 


other  sums 


Here  the  first  sum  is  over  all  setB  of  four  subscripts  no  two  of  which 
are  the  same,  etc.  The  coefficient  2 of  the  second  sum  arises  from 
combining  the  two  equivalent  expressions  £ £ £ Cov(s?,s.v ) and 

(i*W)  J 


21* 


2 

ESS  CovCs^jB^)  , The  other  numerical  coefficients  arise  simlarSy* 

(b*i><3?0 

The  polynotiiials  in  n written  abore  the  summation  signs  indicate  the 
number  of  terms  itiyulved  in  the  emanation. 

Now,  the  terms  under  each  summation  sign  in  (25)  are  all  the  same 
no  matter  what  the  numerical  values  of  the  subscripts!  consequently 


Var  s" 


^fcn4- 

n j_ 


6a' 


t lln2  - 6n)Cer(Bhl,SjjJ  + 2(u^ 


- 3u2  ♦ 2n)Cov(B2,8Jk) 


T p p 

+ k(rr  m 3n  + 2n)Conr(aij,s + 0(n  ) , 


(26) 


p 2 

where  0(n  ) stands  for  terms  of  order  n . In  (26)  and  in  the  follow- 
ing paragraph  it  is  understood  that  h,i,j,k^  . 

Now,  sh1  aiad  Sjj.  fluctuate  independently  over  successive  samples, 

2 

so  that  CtovX3^;,8 j^)  * 0 . The  same  is  true  of  Sj,  and  s^  . Con- 
sequently, 


Var  s * 


1* 

T 

n 


(n5  - 3ne 


2n)Cor(BiJ)Bjk)  + oQ. 


•jtl 


(27) 


Equation.  27  gives  the  desired  result,  but  not  in  a very  useful  form, 
since  Covfs^s^,)  is  a function  of  population  parameters  and  is  gener- 

g 

ally  not  known.  As  a final  step,  then,  it  will  be  shown  that  s (s^z) 
the  actual  variance  (over  items  1 to  n ) of  the  observed  item-test 
covariances,  provides  a "consistent*1  estimate  of  Cov^yS j^)  , i,e.. 


it  will  be  proved  that 
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E8a(au)  - Cw(»11,sJ]t)  + o(i)  . 


DToa  the  formula  for  the  covariance  of  a bur* 


B » iss  , s 

iz  Hj  ij 


b2(81z)  “'TS8(8iJ,Bik)  ' 


the  term  under  the  summation  sign  being  the  actual  covariance  (oyer 
items  1 to  n ) of  the  observed  values  of  s.  , and  a..  : 


I^Vik*  ?(?«)(?») 


Substituting  from  (31 ) into  (30),  and  taking  expected  values,  ve 


Grouping  the  sums  on  the  right,  ve  have 


>.  x 1 rn(a*l)(a-2)  « , n(n-l)(: 

’(»!-)  ? L 2,  ^i^ik  + °<n  > -TT  2 2 

J Ji,  j,  ^ _ » _ Ov  i# 


-l)(n-2)(n-3) 

f 


+ 0(n5) 


dO 


Now>  the  tarcu.  under  each  summation  sign  in  {33)  are  the  same  re- 
gardless of  the  numerical  value  of  the  subscript.  Furthermore/  as  already' 
pointed  out  in  deriving  (27).  ~ 0 vher'  > or  in 


other  vorte,  - E^ik  - 0 , or  - Eo^jk 

sequent ly. 


Con- 


m 


But  this  is  the  same  as  (28)/  which  vaB  to  he  proved. 

2 

The  lEfcrge  sample  standard  error  of  sw  may  therefore  he  estimated 
from  the  actual  variance  of  the  observed  item- test  covariances; 


S.j2.2(s2) 

z 


kz 


n 


(35) 


By  means  of  the  "delta"  method  (k}  Vol.  1/  208  ff . )p  it  is 

readily  shown  from  (35)  that  in  large  samples 

2/  v 

S.E.2(sz)  ^—gS-S.^s2)  »— 1£~  * (36) 

4s„  ns_ 

z z 

If  t/n  is  substituted  for  z in  (36),  the  square  of  the  third 
equation  of  Table  1 is  obtained. 

The  corresponding  squared  standard  error  for  sampling  from  finite 
populations  may  be  shown  to  be 


S.E, 


■<•;> 


i m-n  2/  \ 

k —-6  (s.  ) 
an  iz 


(57) 


The  Kuder-Kiehardgon  Reliability  Coefficient , Formula  20 

Let  the  usual,  formula  for  r^  , the  Kuder -Richards on  formula  ?0, 
he  rewritten  as  follows: 


where  R = ~-Ea2/s2  = M/s2  , say 
Uj_  i'  7.  ' z ’ 


2 

In  the  extraordinary!'  case  where  a = 0 , we  will  agree  not  to  try  to 

z 

compute  any-  value  of  r^  . The  "delta"  method  may  now  he  used  to 
obtain  the  result. 


Var 


s 

z 


Var  M 


M2 

+ — g Var 

s° 

z 


s2  - ^ Cov(M,s2) 

B 

Z 


(39) 


2 

Now  Var(s  ) is  already  known  from  equation  55.  Var(M)  can  he  evalu- 
z 

ated  by  the  usual  formula  for  the  standard  error  of  a mean: 

Var  M = |s2(s2)  , (*0) 

2 2 

where  s (b^  is  the  actual  variance  of  the  observed  item  variances . 

Finally,  it  is  readily  shown,  by  methods  similar  to  those  used  in  aval- 
's 

uating  Var  (b“)  , that 
z 

Cov(M,s2  =|s(Bj,siz) 


; 


(*H) 
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2 

where  3(s.,s.  ) is  the  actual  covariance  between  the  observed  iten 
X xz 

variances  and  the  observed  item-test • covariances, 

^Consequently , 

Var  R:=  J^  rs2(B2)  + 4R2s2(s1s)  - 4Rs(s2,s.  z)~|  . (42) 

. ns  l—  -J 

z 

Nov  Var(r2Q)  = Var(R)  ; hence,  to  order  l/n^  , 
n 

S.E.2(r20)  = Ts2(s2)  + 4n2(l  - r^^s^s^)  - 4n(l  - r^Ms^s^ )”]  . (43) 

n s — * 

z 

It  may  be  noted  that  the  quantity  (l  - r2Q)  is  of  order  l/n  , because, 

by  the  Spearman -Brown  formula,  lim  n(l  - r ) - constant  It  is  then 

n=oo 

seeh-  from  (4j)  that  S.E.  (r^)  is  a quantity  of  order  1 jnr  . Equation 
43  leads  directly  to  the  fourth  formula  of  Table  1. 

It  may  be  shown  that  the  corresponding  standard  error  when  sampling 
from  a finite  population  is  (m  - n)/m  times  the  value  given  in  (43). 


By  a procedure  wholly  parallel  to  that  used  for  the  formula-20  relia- 
bility coefficient,  it  is  found  that,  approximately, 

S.E.2(rai)  =-^[”(1  - ?z)2s2(p)  + 4n2(l  - r2i)2s2(s'i;P 
n^s  L. 
z 

- 4n(l  - r£1)(l  - 2z)s(pi,siz)~J  , (44) 

where  s(p  ,s  ) is  the  actual  covariance  between  the  observed  item 
X xz 
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difficulties  and  the  observed  item-tent  covariances , Equation  44  leads 
directly  to  the  fifth  formula  of  Table  1. 

The  standard  error  of  the  aplit-half  reliability  coefficient  has 
not  been  worked  out.  It  must,  however,  be  larger  than  the  standard 
error  of  r^  , given  by  (4j),  since  r ^ is  the  mean  of  the  spllt- 
half  coefficients  from  all  possible  splits,  as  shewn  by  Cronbach  (2). 

The  Validity  Coefficient 

If  c is  an  outside  criterion. 
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By  the  delta  method, 
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It  is  found  that 
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Finally, 
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Equation  leads  directly  to  the  last  formula  of  Table  1. 

The  corresponding  standard  error  for  sampling  from  a finite  pool 
of  items  is  presumably  (m  - n)/m  times  the  foregoing  quantity ♦ 


1. 
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